Bank-Customers-Segmentation-Analysis

Importing Library

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

Reading Data

In [2]:
data = pd.read_csv("data.csv")
print(data.shape)
data.head(3)
(6819, 96)
Out[2]:
Bankrupt? ROA(C) before interest and depreciation before interest ROA(A) before interest and % after tax ROA(B) before interest and depreciation after tax Operating Gross Margin Realized Sales Gross Margin Operating Profit Rate Pre-tax net Interest Rate After-tax net Interest Rate Non-industry income and expenditure/revenue ... Net Income to Total Assets Total assets to GNP price No-credit Interval Gross Profit to Sales Net Income to Stockholder's Equity Liability to Equity Degree of Financial Leverage (DFL) Interest Coverage Ratio (Interest expense to EBIT) Net Income Flag Equity to Liability
0 1 0.370594 0.424389 0.405750 0.601457 0.601457 0.998969 0.796887 0.808809 0.302646 ... 0.716845 0.009219 0.622879 0.601453 0.827890 0.290202 0.026601 0.564050 1 0.016469
1 1 0.464291 0.538214 0.516730 0.610235 0.610235 0.998946 0.797380 0.809301 0.303556 ... 0.795297 0.008323 0.623652 0.610237 0.839969 0.283846 0.264577 0.570175 1 0.020794
2 1 0.426071 0.499019 0.472295 0.601450 0.601364 0.998857 0.796403 0.808388 0.302035 ... 0.774670 0.040003 0.623841 0.601449 0.836774 0.290189 0.026555 0.563706 1 0.016474

3 rows × 96 columns

EDA

In [3]:
data.tail(3)
Out[3]:
Bankrupt? ROA(C) before interest and depreciation before interest ROA(A) before interest and % after tax ROA(B) before interest and depreciation after tax Operating Gross Margin Realized Sales Gross Margin Operating Profit Rate Pre-tax net Interest Rate After-tax net Interest Rate Non-industry income and expenditure/revenue ... Net Income to Total Assets Total assets to GNP price No-credit Interval Gross Profit to Sales Net Income to Stockholder's Equity Liability to Equity Degree of Financial Leverage (DFL) Interest Coverage Ratio (Interest expense to EBIT) Net Income Flag Equity to Liability
6816 0 0.472725 0.533744 0.520638 0.610444 0.610213 0.998984 0.797401 0.809317 0.303512 ... 0.797778 0.002840 0.624156 0.610441 0.840138 0.275789 0.026791 0.565158 1 0.097649
6817 0 0.506264 0.559911 0.554045 0.607850 0.607850 0.999074 0.797500 0.809399 0.303498 ... 0.811808 0.002837 0.623957 0.607846 0.841084 0.277547 0.026822 0.565302 1 0.044009
6818 0 0.493053 0.570105 0.549548 0.627409 0.627409 0.998080 0.801987 0.813800 0.313415 ... 0.815956 0.000707 0.626680 0.627408 0.841019 0.275114 0.026793 0.565167 1 0.233902

3 rows × 96 columns

In [4]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6819 entries, 0 to 6818
Data columns (total 96 columns):
 #   Column                                                    Non-Null Count  Dtype  
---  ------                                                    --------------  -----  
 0   Bankrupt?                                                 6819 non-null   int64  
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64
 2    ROA(A) before interest and % after tax                   6819 non-null   float64
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64
 4    Operating Gross Margin                                   6819 non-null   float64
 5    Realized Sales Gross Margin                              6819 non-null   float64
 6    Operating Profit Rate                                    6819 non-null   float64
 7    Pre-tax net Interest Rate                                6819 non-null   float64
 8    After-tax net Interest Rate                              6819 non-null   float64
 9    Non-industry income and expenditure/revenue              6819 non-null   float64
 10   Continuous interest rate (after tax)                     6819 non-null   float64
 11   Operating Expense Rate                                   6819 non-null   float64
 12   Research and development expense rate                    6819 non-null   float64
 13   Cash flow rate                                           6819 non-null   float64
 14   Interest-bearing debt interest rate                      6819 non-null   float64
 15   Tax rate (A)                                             6819 non-null   float64
 16   Net Value Per Share (B)                                  6819 non-null   float64
 17   Net Value Per Share (A)                                  6819 non-null   float64
 18   Net Value Per Share (C)                                  6819 non-null   float64
 19   Persistent EPS in the Last Four Seasons                  6819 non-null   float64
 20   Cash Flow Per Share                                      6819 non-null   float64
 21   Revenue Per Share (Yuan ¥)                               6819 non-null   float64
 22   Operating Profit Per Share (Yuan ¥)                      6819 non-null   float64
 23   Per Share Net profit before tax (Yuan ¥)                 6819 non-null   float64
 24   Realized Sales Gross Profit Growth Rate                  6819 non-null   float64
 25   Operating Profit Growth Rate                             6819 non-null   float64
 26   After-tax Net Profit Growth Rate                         6819 non-null   float64
 27   Regular Net Profit Growth Rate                           6819 non-null   float64
 28   Continuous Net Profit Growth Rate                        6819 non-null   float64
 29   Total Asset Growth Rate                                  6819 non-null   float64
 30   Net Value Growth Rate                                    6819 non-null   float64
 31   Total Asset Return Growth Rate Ratio                     6819 non-null   float64
 32   Cash Reinvestment %                                      6819 non-null   float64
 33   Current Ratio                                            6819 non-null   float64
 34   Quick Ratio                                              6819 non-null   float64
 35   Interest Expense Ratio                                   6819 non-null   float64
 36   Total debt/Total net worth                               6819 non-null   float64
 37   Debt ratio %                                             6819 non-null   float64
 38   Net worth/Assets                                         6819 non-null   float64
 39   Long-term fund suitability ratio (A)                     6819 non-null   float64
 40   Borrowing dependency                                     6819 non-null   float64
 41   Contingent liabilities/Net worth                         6819 non-null   float64
 42   Operating profit/Paid-in capital                         6819 non-null   float64
 43   Net profit before tax/Paid-in capital                    6819 non-null   float64
 44   Inventory and accounts receivable/Net value              6819 non-null   float64
 45   Total Asset Turnover                                     6819 non-null   float64
 46   Accounts Receivable Turnover                             6819 non-null   float64
 47   Average Collection Days                                  6819 non-null   float64
 48   Inventory Turnover Rate (times)                          6819 non-null   float64
 49   Fixed Assets Turnover Frequency                          6819 non-null   float64
 50   Net Worth Turnover Rate (times)                          6819 non-null   float64
 51   Revenue per person                                       6819 non-null   float64
 52   Operating profit per person                              6819 non-null   float64
 53   Allocation rate per person                               6819 non-null   float64
 54   Working Capital to Total Assets                          6819 non-null   float64
 55   Quick Assets/Total Assets                                6819 non-null   float64
 56   Current Assets/Total Assets                              6819 non-null   float64
 57   Cash/Total Assets                                        6819 non-null   float64
 58   Quick Assets/Current Liability                           6819 non-null   float64
 59   Cash/Current Liability                                   6819 non-null   float64
 60   Current Liability to Assets                              6819 non-null   float64
 61   Operating Funds to Liability                             6819 non-null   float64
 62   Inventory/Working Capital                                6819 non-null   float64
 63   Inventory/Current Liability                              6819 non-null   float64
 64   Current Liabilities/Liability                            6819 non-null   float64
 65   Working Capital/Equity                                   6819 non-null   float64
 66   Current Liabilities/Equity                               6819 non-null   float64
 67   Long-term Liability to Current Assets                    6819 non-null   float64
 68   Retained Earnings to Total Assets                        6819 non-null   float64
 69   Total income/Total expense                               6819 non-null   float64
 70   Total expense/Assets                                     6819 non-null   float64
 71   Current Asset Turnover Rate                              6819 non-null   float64
 72   Quick Asset Turnover Rate                                6819 non-null   float64
 73   Working capitcal Turnover Rate                           6819 non-null   float64
 74   Cash Turnover Rate                                       6819 non-null   float64
 75   Cash Flow to Sales                                       6819 non-null   float64
 76   Fixed Assets to Assets                                   6819 non-null   float64
 77   Current Liability to Liability                           6819 non-null   float64
 78   Current Liability to Equity                              6819 non-null   float64
 79   Equity to Long-term Liability                            6819 non-null   float64
 80   Cash Flow to Total Assets                                6819 non-null   float64
 81   Cash Flow to Liability                                   6819 non-null   float64
 82   CFO to Assets                                            6819 non-null   float64
 83   Cash Flow to Equity                                      6819 non-null   float64
 84   Current Liability to Current Assets                      6819 non-null   float64
 85   Liability-Assets Flag                                    6819 non-null   int64  
 86   Net Income to Total Assets                               6819 non-null   float64
 87   Total assets to GNP price                                6819 non-null   float64
 88   No-credit Interval                                       6819 non-null   float64
 89   Gross Profit to Sales                                    6819 non-null   float64
 90   Net Income to Stockholder's Equity                       6819 non-null   float64
 91   Liability to Equity                                      6819 non-null   float64
 92   Degree of Financial Leverage (DFL)                       6819 non-null   float64
 93   Interest Coverage Ratio (Interest expense to EBIT)       6819 non-null   float64
 94   Net Income Flag                                          6819 non-null   int64  
 95   Equity to Liability                                      6819 non-null   float64
dtypes: float64(93), int64(3)
memory usage: 5.0 MB
In [5]:
data.describe()
Out[5]:
Bankrupt? ROA(C) before interest and depreciation before interest ROA(A) before interest and % after tax ROA(B) before interest and depreciation after tax Operating Gross Margin Realized Sales Gross Margin Operating Profit Rate Pre-tax net Interest Rate After-tax net Interest Rate Non-industry income and expenditure/revenue ... Net Income to Total Assets Total assets to GNP price No-credit Interval Gross Profit to Sales Net Income to Stockholder's Equity Liability to Equity Degree of Financial Leverage (DFL) Interest Coverage Ratio (Interest expense to EBIT) Net Income Flag Equity to Liability
count 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 ... 6819.000000 6.819000e+03 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.000000 6819.0 6819.000000
mean 0.032263 0.505180 0.558625 0.553589 0.607948 0.607929 0.998755 0.797190 0.809084 0.303623 ... 0.807760 1.862942e+07 0.623915 0.607946 0.840402 0.280365 0.027541 0.565358 1.0 0.047578
std 0.176710 0.060686 0.065620 0.061595 0.016934 0.016916 0.013010 0.012869 0.013601 0.011163 ... 0.040332 3.764501e+08 0.012290 0.016934 0.014523 0.014463 0.015668 0.013214 0.0 0.050014
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0 0.000000
25% 0.000000 0.476527 0.535543 0.527277 0.600445 0.600434 0.998969 0.797386 0.809312 0.303466 ... 0.796750 9.036205e-04 0.623636 0.600443 0.840115 0.276944 0.026791 0.565158 1.0 0.024477
50% 0.000000 0.502706 0.559802 0.552278 0.605997 0.605976 0.999022 0.797464 0.809375 0.303525 ... 0.810619 2.085213e-03 0.623879 0.605998 0.841179 0.278778 0.026808 0.565252 1.0 0.033798
75% 0.000000 0.535563 0.589157 0.584105 0.613914 0.613842 0.999095 0.797579 0.809469 0.303585 ... 0.826455 5.269777e-03 0.624168 0.613913 0.842357 0.281449 0.026913 0.565725 1.0 0.052838
max 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 ... 1.000000 9.820000e+09 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.0 1.000000

8 rows × 96 columns

In [6]:
data.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Bankrupt? 6819.0 0.032263 0.176710 0.0 0.000000 0.000000 0.000000 1.0
ROA(C) before interest and depreciation before interest 6819.0 0.505180 0.060686 0.0 0.476527 0.502706 0.535563 1.0
ROA(A) before interest and % after tax 6819.0 0.558625 0.065620 0.0 0.535543 0.559802 0.589157 1.0
ROA(B) before interest and depreciation after tax 6819.0 0.553589 0.061595 0.0 0.527277 0.552278 0.584105 1.0
Operating Gross Margin 6819.0 0.607948 0.016934 0.0 0.600445 0.605997 0.613914 1.0
... ... ... ... ... ... ... ... ...
Liability to Equity 6819.0 0.280365 0.014463 0.0 0.276944 0.278778 0.281449 1.0
Degree of Financial Leverage (DFL) 6819.0 0.027541 0.015668 0.0 0.026791 0.026808 0.026913 1.0
Interest Coverage Ratio (Interest expense to EBIT) 6819.0 0.565358 0.013214 0.0 0.565158 0.565252 0.565725 1.0
Net Income Flag 6819.0 1.000000 0.000000 1.0 1.000000 1.000000 1.000000 1.0
Equity to Liability 6819.0 0.047578 0.050014 0.0 0.024477 0.033798 0.052838 1.0

96 rows × 8 columns

Feature Selection

In [7]:
#Check for Missing Values
plt.figure(figsize=(16,12))
sns.heatmap(data.isnull(), cmap = 'magma')
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x176b130c048>
In [1]:
#Check if there are any duplicate rows
data.duplicated(keep=False).sum()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-1abe060351e3> in <module>
      1 #Check if there are any duplicate rows
----> 2 data.duplicated(keep=False).sum()

NameError: name 'data' is not defined

Removing Spaces in Column name

In [9]:
print("Column names before renaming","\n", data.columns[:5],"\n")
data.columns = data.columns.str.strip()
data.columns = data.columns.str.replace(" " ,"_")
data.rename(columns = {'Bankrupt?' :'Bankrupt' },inplace=True)
print("Column names after renaming","\n",data.columns[:5])
Column names before renaming 
 Index(['Bankrupt?', ' ROA(C) before interest and depreciation before interest',
       ' ROA(A) before interest and % after tax',
       ' ROA(B) before interest and depreciation after tax',
       ' Operating Gross Margin'],
      dtype='object') 

Column names after renaming 
 Index(['Bankrupt', 'ROA(C)_before_interest_and_depreciation_before_interest',
       'ROA(A)_before_interest_and_%_after_tax',
       'ROA(B)_before_interest_and_depreciation_after_tax',
       'Operating_Gross_Margin'],
      dtype='object')
In [10]:
data.duplicated(keep=False).sum()  ## no duplicates
Out[10]:
0

Identifying Columns with Single Value and eliminating them using VarianceThreshold

In [11]:
import pandas as pd
from sklearn.feature_selection import VarianceThreshold
In [13]:
columns=data.columns
In [15]:
columns
Out[15]:
Index(['Bankrupt', 'ROA(C)_before_interest_and_depreciation_before_interest',
       'ROA(A)_before_interest_and_%_after_tax',
       'ROA(B)_before_interest_and_depreciation_after_tax',
       'Operating_Gross_Margin', 'Realized_Sales_Gross_Margin',
       'Operating_Profit_Rate', 'Pre-tax_net_Interest_Rate',
       'After-tax_net_Interest_Rate',
       'Non-industry_income_and_expenditure/revenue',
       'Continuous_interest_rate_(after_tax)', 'Operating_Expense_Rate',
       'Research_and_development_expense_rate', 'Cash_flow_rate',
       'Interest-bearing_debt_interest_rate', 'Tax_rate_(A)',
       'Net_Value_Per_Share_(B)', 'Net_Value_Per_Share_(A)',
       'Net_Value_Per_Share_(C)', 'Persistent_EPS_in_the_Last_Four_Seasons',
       'Cash_Flow_Per_Share', 'Revenue_Per_Share_(Yuan_Â¥)',
       'Operating_Profit_Per_Share_(Yuan_Â¥)',
       'Per_Share_Net_profit_before_tax_(Yuan_Â¥)',
       'Realized_Sales_Gross_Profit_Growth_Rate',
       'Operating_Profit_Growth_Rate', 'After-tax_Net_Profit_Growth_Rate',
       'Regular_Net_Profit_Growth_Rate', 'Continuous_Net_Profit_Growth_Rate',
       'Total_Asset_Growth_Rate', 'Net_Value_Growth_Rate',
       'Total_Asset_Return_Growth_Rate_Ratio', 'Cash_Reinvestment_%',
       'Current_Ratio', 'Quick_Ratio', 'Interest_Expense_Ratio',
       'Total_debt/Total_net_worth', 'Debt_ratio_%', 'Net_worth/Assets',
       'Long-term_fund_suitability_ratio_(A)', 'Borrowing_dependency',
       'Contingent_liabilities/Net_worth', 'Operating_profit/Paid-in_capital',
       'Net_profit_before_tax/Paid-in_capital',
       'Inventory_and_accounts_receivable/Net_value', 'Total_Asset_Turnover',
       'Accounts_Receivable_Turnover', 'Average_Collection_Days',
       'Inventory_Turnover_Rate_(times)', 'Fixed_Assets_Turnover_Frequency',
       'Net_Worth_Turnover_Rate_(times)', 'Revenue_per_person',
       'Operating_profit_per_person', 'Allocation_rate_per_person',
       'Working_Capital_to_Total_Assets', 'Quick_Assets/Total_Assets',
       'Current_Assets/Total_Assets', 'Cash/Total_Assets',
       'Quick_Assets/Current_Liability', 'Cash/Current_Liability',
       'Current_Liability_to_Assets', 'Operating_Funds_to_Liability',
       'Inventory/Working_Capital', 'Inventory/Current_Liability',
       'Current_Liabilities/Liability', 'Working_Capital/Equity',
       'Current_Liabilities/Equity', 'Long-term_Liability_to_Current_Assets',
       'Retained_Earnings_to_Total_Assets', 'Total_income/Total_expense',
       'Total_expense/Assets', 'Current_Asset_Turnover_Rate',
       'Quick_Asset_Turnover_Rate', 'Working_capitcal_Turnover_Rate',
       'Cash_Turnover_Rate', 'Cash_Flow_to_Sales', 'Fixed_Assets_to_Assets',
       'Current_Liability_to_Liability', 'Current_Liability_to_Equity',
       'Equity_to_Long-term_Liability', 'Cash_Flow_to_Total_Assets',
       'Cash_Flow_to_Liability', 'CFO_to_Assets', 'Cash_Flow_to_Equity',
       'Current_Liability_to_Current_Assets', 'Liability-Assets_Flag',
       'Net_Income_to_Total_Assets', 'Total_assets_to_GNP_price',
       'No-credit_Interval', 'Gross_Profit_to_Sales',
       'Net_Income_to_Stockholder's_Equity', 'Liability_to_Equity',
       'Degree_of_Financial_Leverage_(DFL)',
       'Interest_Coverage_Ratio_(Interest_expense_to_EBIT)', 'Net_Income_Flag',
       'Equity_to_Liability'],
      dtype='object')
In [16]:
df=data
In [17]:
df
Out[17]:
Bankrupt ROA(C)_before_interest_and_depreciation_before_interest ROA(A)_before_interest_and_%_after_tax ROA(B)_before_interest_and_depreciation_after_tax Operating_Gross_Margin Realized_Sales_Gross_Margin Operating_Profit_Rate Pre-tax_net_Interest_Rate After-tax_net_Interest_Rate Non-industry_income_and_expenditure/revenue ... Net_Income_to_Total_Assets Total_assets_to_GNP_price No-credit_Interval Gross_Profit_to_Sales Net_Income_to_Stockholder's_Equity Liability_to_Equity Degree_of_Financial_Leverage_(DFL) Interest_Coverage_Ratio_(Interest_expense_to_EBIT) Net_Income_Flag Equity_to_Liability
0 1 0.370594 0.424389 0.405750 0.601457 0.601457 0.998969 0.796887 0.808809 0.302646 ... 0.716845 0.009219 0.622879 0.601453 0.827890 0.290202 0.026601 0.564050 1 0.016469
1 1 0.464291 0.538214 0.516730 0.610235 0.610235 0.998946 0.797380 0.809301 0.303556 ... 0.795297 0.008323 0.623652 0.610237 0.839969 0.283846 0.264577 0.570175 1 0.020794
2 1 0.426071 0.499019 0.472295 0.601450 0.601364 0.998857 0.796403 0.808388 0.302035 ... 0.774670 0.040003 0.623841 0.601449 0.836774 0.290189 0.026555 0.563706 1 0.016474
3 1 0.399844 0.451265 0.457733 0.583541 0.583541 0.998700 0.796967 0.808966 0.303350 ... 0.739555 0.003252 0.622929 0.583538 0.834697 0.281721 0.026697 0.564663 1 0.023982
4 1 0.465022 0.538432 0.522298 0.598783 0.598783 0.998973 0.797366 0.809304 0.303475 ... 0.795016 0.003878 0.623521 0.598782 0.839973 0.278514 0.024752 0.575617 1 0.035490
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
6814 0 0.493687 0.539468 0.543230 0.604455 0.604462 0.998992 0.797409 0.809331 0.303510 ... 0.799927 0.000466 0.623620 0.604455 0.840359 0.279606 0.027064 0.566193 1 0.029890
6815 0 0.475162 0.538269 0.524172 0.598308 0.598308 0.998992 0.797414 0.809327 0.303520 ... 0.799748 0.001959 0.623931 0.598306 0.840306 0.278132 0.027009 0.566018 1 0.038284
6816 0 0.472725 0.533744 0.520638 0.610444 0.610213 0.998984 0.797401 0.809317 0.303512 ... 0.797778 0.002840 0.624156 0.610441 0.840138 0.275789 0.026791 0.565158 1 0.097649
6817 0 0.506264 0.559911 0.554045 0.607850 0.607850 0.999074 0.797500 0.809399 0.303498 ... 0.811808 0.002837 0.623957 0.607846 0.841084 0.277547 0.026822 0.565302 1 0.044009
6818 0 0.493053 0.570105 0.549548 0.627409 0.627409 0.998080 0.801987 0.813800 0.313415 ... 0.815956 0.000707 0.626680 0.627408 0.841019 0.275114 0.026793 0.565167 1 0.233902

6819 rows × 96 columns

In [19]:
var_thres=VarianceThreshold(threshold=0)
var_thres.fit(data)
Out[19]:
VarianceThreshold(threshold=0)
In [20]:
var_thres.get_support()
Out[20]:
array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True, False,  True])
In [21]:
sum(var_thres.get_support())
Out[21]:
95
In [23]:
len(data.columns[var_thres.get_support()])
Out[23]:
95
In [25]:
constant_columns = [column for column in data.columns
                    if column not in data.columns[var_thres.get_support()]]

print(len(constant_columns))
1
In [26]:
for column in constant_columns:
    print(column)
Net_Income_Flag
In [31]:
data=data.drop(constant_columns,axis=1)

Using Correlation To Identify Important Features

In [44]:
import seaborn as sns
#Using Pearson Correlation
corrmat = data.corr()
fig, ax = plt.subplots()
fig.set_size_inches(11,11)
sns.heatmap(corrmat)
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x176b6a51d88>
In [46]:
def correlation(dataset, threshold):
    col_corr = set()  # Set of all the names of correlated columns
    corr_matrix = dataset.corr()
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if abs(corr_matrix.iloc[i, j]) > threshold: # we are interested in absolute coeff value
                colname = corr_matrix.columns[i]  # getting the name of column
                col_corr.add(colname)
    return col_corr
In [52]:
corr_features = correlation(data, 0.9)
len(set(corr_features))
Out[52]:
19
In [53]:
corr_features
Out[53]:
{'After-tax_net_Interest_Rate',
 'Cash_Flow_to_Sales',
 'Continuous_interest_rate_(after_tax)',
 'Current_Liability_to_Equity',
 'Current_Liability_to_Liability',
 'Gross_Profit_to_Sales',
 'Liability_to_Equity',
 'Net_Income_to_Total_Assets',
 'Net_Value_Per_Share_(A)',
 'Net_Value_Per_Share_(C)',
 'Net_profit_before_tax/Paid-in_capital',
 'Net_worth/Assets',
 'Operating_profit/Paid-in_capital',
 'Per_Share_Net_profit_before_tax_(Yuan_Â¥)',
 'Pre-tax_net_Interest_Rate',
 'ROA(A)_before_interest_and_%_after_tax',
 'ROA(B)_before_interest_and_depreciation_after_tax',
 'Realized_Sales_Gross_Margin',
 'Regular_Net_Profit_Growth_Rate'}
In [54]:
data=data.drop(corr_features,axis=1)
In [55]:
data.shape
Out[55]:
(6819, 76)
In [57]:
data['Bankrupt'].unique()
Out[57]:
array([1, 0], dtype=int64)
In [58]:
### Train test split to avoid overfitting
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(data.drop(labels=['Bankrupt'], axis=1),
    df['Bankrupt'],
    test_size=0.3,
    random_state=0)
In [59]:
from sklearn.feature_selection import mutual_info_classif
# determine the mutual information
mutual_info = mutual_info_classif(X_train, y_train)
mutual_info
Out[59]:
array([2.85819581e-02, 1.67067106e-02, 2.04888531e-02, 3.19904898e-02,
       1.75410538e-03, 3.20556828e-05, 1.37990651e-02, 2.53300947e-03,
       1.68455505e-02, 3.16531125e-02, 4.28292788e-02, 5.61138436e-03,
       1.22360638e-03, 2.42613460e-02, 8.63141207e-03, 6.65680843e-03,
       1.20528604e-02, 1.10315419e-02, 1.17782652e-02, 3.55110218e-03,
       1.25280002e-02, 1.59331758e-03, 2.52960589e-02, 2.65987575e-02,
       3.45523363e-02, 3.45925369e-02, 3.57018490e-02, 1.05099357e-02,
       3.99168512e-02, 3.72632687e-03, 4.45861891e-03, 1.04993509e-02,
       0.00000000e+00, 4.94125033e-03, 2.02723563e-03, 5.18807197e-03,
       2.26770492e-04, 0.00000000e+00, 2.41801197e-02, 7.06248550e-03,
       2.65019169e-02, 5.76041803e-03, 7.12647241e-04, 1.67924075e-02,
       2.12298268e-02, 1.49178508e-02, 1.61032797e-02, 1.73149134e-02,
       2.30699758e-02, 2.70612427e-03, 3.05529702e-03, 2.23208706e-02,
       2.39963325e-02, 5.47623151e-03, 3.45783255e-02, 3.41727610e-02,
       9.07947884e-03, 0.00000000e+00, 1.00267560e-03, 1.92357967e-02,
       1.98963266e-03, 4.44445006e-03, 1.42691503e-02, 5.73793563e-03,
       9.42270718e-03, 8.17357567e-03, 8.35080177e-03, 2.70556816e-02,
       1.63256316e-03, 1.13871015e-03, 7.57730017e-03, 4.20451791e-02,
       3.27411959e-02, 3.34902545e-02, 3.52018321e-02])
In [60]:
mutual_info = pd.Series(mutual_info)
mutual_info.index = X_train.columns
mutual_info.sort_values(ascending=False)
Out[60]:
Persistent_EPS_in_the_Last_Four_Seasons    0.042829
Net_Income_to_Stockholder's_Equity         0.042045
Borrowing_dependency                       0.039917
Debt_ratio_%                               0.035702
Equity_to_Liability                        0.035202
                                             ...   
Net_Worth_Turnover_Rate_(times)            0.000227
Research_and_development_expense_rate      0.000032
Accounts_Receivable_Turnover               0.000000
Current_Asset_Turnover_Rate                0.000000
Revenue_per_person                         0.000000
Length: 75, dtype: float64
In [61]:
mutual_info.sort_values(ascending=False).plot.bar(figsize=(20, 8))
Out[61]:
<matplotlib.axes._subplots.AxesSubplot at 0x176b710d048>
In [62]:
from sklearn.feature_selection import SelectKBest
In [66]:
sel_five_cols = SelectKBest(mutual_info_classif, k=70)
sel_five_cols.fit(X_train, y_train)
c=X_train.columns[sel_five_cols.get_support()]
In [68]:
df=df[c]
In [69]:
df.shape
Out[69]:
(6819, 70)
In [70]:
df.head()
Out[70]:
ROA(C)_before_interest_and_depreciation_before_interest Operating_Gross_Margin Operating_Profit_Rate Non-industry_income_and_expenditure/revenue Operating_Expense_Rate Research_and_development_expense_rate Cash_flow_rate Interest-bearing_debt_interest_rate Tax_rate_(A) Net_Value_Per_Share_(B) ... CFO_to_Assets Cash_Flow_to_Equity Current_Liability_to_Current_Assets Liability-Assets_Flag Total_assets_to_GNP_price No-credit_Interval Net_Income_to_Stockholder's_Equity Degree_of_Financial_Leverage_(DFL) Interest_Coverage_Ratio_(Interest_expense_to_EBIT) Equity_to_Liability
0 0.370594 0.601457 0.998969 0.302646 1.256969e-04 0.0 0.458143 0.000725 0.0 0.147950 ... 0.520382 0.312905 0.118250 0 0.009219 0.622879 0.827890 0.026601 0.564050 0.016469
1 0.464291 0.610235 0.998946 0.303556 2.897851e-04 0.0 0.461867 0.000647 0.0 0.182251 ... 0.567101 0.314163 0.047775 0 0.008323 0.623652 0.839969 0.264577 0.570175 0.020794
2 0.426071 0.601450 0.998857 0.302035 2.361297e-04 25500000.0 0.458521 0.000790 0.0 0.177911 ... 0.538491 0.314515 0.025346 0 0.040003 0.623841 0.836774 0.026555 0.563706 0.016474
3 0.399844 0.583541 0.998700 0.303350 1.078888e-04 0.0 0.465705 0.000449 0.0 0.154187 ... 0.604105 0.302382 0.067250 0 0.003252 0.622929 0.834697 0.026697 0.564663 0.023982
4 0.465022 0.598783 0.998973 0.303475 7.890000e+09 0.0 0.462746 0.000686 0.0 0.167502 ... 0.578469 0.311567 0.047725 0 0.003878 0.623521 0.839973 0.024752 0.575617 0.035490

5 rows × 70 columns

In [74]:
df = pd.concat([df,data['Bankrupt']], axis=1).reindex(df.index)
In [75]:
df.shape
Out[75]:
(6819, 71)
In [76]:
df['Bankrupt']
Out[76]:
0       1
1       1
2       1
3       1
4       1
       ..
6814    0
6815    0
6816    0
6817    0
6818    0
Name: Bankrupt, Length: 6819, dtype: int64
In [77]:
data=df

Getting rid of Outliers

  • First separate all 71 features into two groups

    • fraction-only features (i.e. features having values in [0,1])
    • other than fraction-only features
  • 49 features are fraction-only features where as 24 are other than fraction-only features.

  • Outliers are mainly present in these 21 "other than fraction-only" features

  • To explore the outliers nature, distribution of these 24 features, are obtained using :

    • Histogram
    • Boxplot
In [82]:
## fn to separate only-fractional & other columns
def get_fraction_valued_columns(df):
    my_columns  = []
    for col in df.columns:
        if (data[col].max()<=1) & (data[col].min() >= 0):
            my_columns.append(col)
    return(my_columns)

fractional_columns = get_fraction_valued_columns(df=data.drop(['Bankrupt'],axis=1))
non_fraction_columns = data.drop(['Bankrupt'],axis=1).columns.difference(fractional_columns)
print("# Fraction-only Columns",len(fractional_columns),"\t","# Other than Fraction-only Columns", len(non_fraction_columns))
# Fraction-only Columns 49 	 # Other than Fraction-only Columns 21
In [83]:
data[non_fraction_columns].hist(figsize= (20,20),sharex=True,layout= (6,4))
plt.show()
In [84]:
data[non_fraction_columns].boxplot(vert=False,figsize= (15,10))
plt.subplots_adjust(left=0.25)
plt.show()
In [85]:
log_transformed_cols = []
for col in data[non_fraction_columns].columns:
    if (data[col].quantile(1) >= 100* data[col].quantile(0.99)) |  (sum(data[col] > data[col].quantile(0.99)) <= 10):
        data[col] = np.log1p(data[col])
        log_transformed_cols.append(col)
        
## Change names of log transformed column
log_names = "log_" + data[log_transformed_cols].columns
data.rename(columns={data[log_transformed_cols].columns[i]: log_names[i] for i in range(len(log_names))}, inplace = True)
In [86]:
print("The following features are log transformed after they fulfill outlier detection condition.","\n\n",log_transformed_cols)
data[log_names].boxplot(vert=False,figsize= (15,10))
plt.subplots_adjust(left=0.25)
plt.title("Boxplot of Outlier infected features after log transformation")
plt.show()
The following features are log transformed after they fulfill outlier detection condition. 

 ['Allocation_rate_per_person', 'Average_Collection_Days', 'Cash/Current_Liability', 'Current_Ratio', 'Fixed_Assets_to_Assets', 'Net_Value_Growth_Rate', 'Quick_Assets/Current_Liability', 'Quick_Ratio', 'Revenue_Per_Share_(Yuan_Â¥)', 'Revenue_per_person', 'Total_assets_to_GNP_price', 'Total_debt/Total_net_worth']
In [87]:
df1 = pd.DataFrame(data.Bankrupt.value_counts())
df2 = pd.DataFrame(100*data.Bankrupt.value_counts(normalize=True).astype(float))
tab = df1.merge(df2,left_index=True,right_index=True).rename(columns = {"Bankrupt_x" : "Count" , "Bankrupt_y" : "Percentage"})
print(tab)
   Count  Percentage
0   6599    96.77372
1    220     3.22628
In [88]:
plt.pie(tab['Count'], labels= [0,1])
Out[88]:
([<matplotlib.patches.Wedge at 0x176b8df52c8>,
  <matplotlib.patches.Wedge at 0x176b8df5ac8>],
 [Text(-1.0943545992823505, 0.11130144217199622, '0'),
  Text(1.094354596677152, -0.11130146778721411, '1')])
In [98]:
data.to_csv('New_data.csv')

EDA using AutoViz

In [97]:
from autoviz.AutoViz_Class import AutoViz_Class
AV = AutoViz_Class()
df = AV.AutoViz('New_data.csv')
Shape of your Data Set: (6819, 72)
############## C L A S S I F Y I N G  V A R I A B L E S  ####################
Classifying variables in data set...
    Number of Numeric Columns =  69
    Number of Integer-Categorical Columns =  0
    Number of String-Categorical Columns =  0
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  0
    Number of Numeric-Boolean Columns =  2
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  1
    Number of Columns to Delete =  0
    72 Predictors classified...
        This does not include the Target column(s)
        1 variables removed since they were ID or low-information variables
30 numeric variables in data exceeds limit, taking top 30 variables
Number of All Scatter Plots = 465
Time to run AutoViz (in seconds) = 225.686

 ###################### VISUALIZATION Completed ########################
In [ ]:
# EDA using Sweetviz
In [143]:
import sweetviz as sv
sweet_report = sv.analyze(data)
sweet_report.show_html('sweet_report.html')
Report sweet_report.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
In [100]:
y = data['Bankrupt']
X = data.drop(['Bankrupt'], axis = 1)
In [101]:
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, chi2 , mutual_info_classif 
from imblearn.over_sampling import SMOTE 
In [102]:
smote = SMOTE(sampling_strategy='minority')
X_sm, y_sm = smote.fit_resample(X,y)
y_sm.value_counts()
Out[102]:
1    6599
0    6599
Name: Bankrupt, dtype: int64
In [103]:
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm, test_size=0.2, random_state = 101, stratify = y_sm)
In [108]:
from sklearn.model_selection import train_test_split,StratifiedKFold
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
In [109]:
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

Implementing Grid Search CV for 4 models(SVM,Random Forest, Logistic Regression & K Nearest Neighbours)

In [110]:
from sklearn.model_selection import GridSearchCV
In [111]:
model_params = {
    'svm': {
        'model': SVC(gamma='auto',probability=True),
        'params' : {
            'C': [1,10,20],
            'kernel': ['rbf','linear','sigmoid']
        }  
    },
    'random_forest': {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10,20,30,50],
            'criterion':['gini','entropy']
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(multi_class='auto'),
        'params': {
            'C': [1,5,10],
            'solver':['lbfgs','liblinear']
        }
    },
   'KNN': {
        'model':KNeighborsClassifier(),
        'params':{
            'n_neighbors' : [1,3,5,7],
            'algorithm':['auto','kd_tree']
        }
    }
}
In [112]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_sm)
In [113]:
scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=3, verbose=3, n_jobs=-1, scoring = 'recall', return_train_score=False)
    clf.fit(X_scaled, y_sm)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
df = pd.DataFrame(scores,columns=['model','best_score','best_params'])
df
Fitting 3 folds for each of 9 candidates, totalling 27 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  22 out of  27 | elapsed:  3.8min remaining:   51.2s
[Parallel(n_jobs=-1)]: Done  27 out of  27 | elapsed:  6.7min finished
Fitting 3 folds for each of 12 candidates, totalling 36 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:    2.8s
[Parallel(n_jobs=-1)]: Done  34 out of  36 | elapsed:    8.7s remaining:    0.4s
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:    9.3s finished
Fitting 3 folds for each of 6 candidates, totalling 18 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  10 out of  18 | elapsed:    2.3s remaining:    1.8s
[Parallel(n_jobs=-1)]: Done  18 out of  18 | elapsed:    5.9s finished
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  18 out of  24 | elapsed:   36.5s remaining:   12.1s
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:   40.4s finished
Out[113]:
model best_score best_params
0 svm 0.994697 {'C': 20, 'kernel': 'rbf'}
1 random_forest 0.994393 {'criterion': 'entropy', 'n_estimators': 30}
2 logistic_regression 0.929081 {'C': 10, 'solver': 'liblinear'}
3 KNN 0.999697 {'algorithm': 'auto', 'n_neighbors': 3}

SVM

In [114]:
model_svm = SVC(gamma='auto',C=20, kernel='rbf', probability=True )
model_svm.fit(X_train,y_train)
svm_predictions = model_svm.predict(X_test)
In [115]:
#Printing Confusion Matrix
pd.DataFrame(confusion_matrix(y_test,svm_predictions))
Out[115]:
0 1
0 1276 44
1 4 1316
In [116]:
print(classification_report(y_test,svm_predictions))
              precision    recall  f1-score   support

           0       1.00      0.97      0.98      1320
           1       0.97      1.00      0.98      1320

    accuracy                           0.98      2640
   macro avg       0.98      0.98      0.98      2640
weighted avg       0.98      0.98      0.98      2640

RandomForestClassifier

In [120]:
rfc = RandomForestClassifier(n_estimators=30, criterion='entropy' )
rfc.fit(X_train, y_train)
rfc_pred = rfc.predict(X_test)
In [121]:
#Printing Confusion Matrix
pd.DataFrame(confusion_matrix(y_test,rfc_pred))
Out[121]:
0 1
0 1274 46
1 9 1311
In [122]:
print(classification_report(y_test,rfc_pred))
              precision    recall  f1-score   support

           0       0.99      0.97      0.98      1320
           1       0.97      0.99      0.98      1320

    accuracy                           0.98      2640
   macro avg       0.98      0.98      0.98      2640
weighted avg       0.98      0.98      0.98      2640

LogisticRegression

In [123]:
logmodel = LogisticRegression(solver='liblinear',multi_class='auto', C = 10)
logmodel.fit(X_train,y_train)
log_predictions = logmodel.predict(X_test)
In [124]:
pd.DataFrame(confusion_matrix(y_test,log_predictions))
Out[124]:
0 1
0 1184 136
1 123 1197
In [125]:
print(classification_report(y_test,log_predictions))
              precision    recall  f1-score   support

           0       0.91      0.90      0.90      1320
           1       0.90      0.91      0.90      1320

    accuracy                           0.90      2640
   macro avg       0.90      0.90      0.90      2640
weighted avg       0.90      0.90      0.90      2640

KNeighborsClassifier

In [126]:
knn = KNeighborsClassifier(n_neighbors=3,algorithm = 'auto')
knn.fit(X_train,y_train)
knn_pred = knn.predict(X_test)
In [127]:
pd.DataFrame(confusion_matrix(y_test,knn_pred))
Out[127]:
0 1
0 1209 111
1 0 1320
In [128]:
print(classification_report(y_test,knn_pred))
              precision    recall  f1-score   support

           0       1.00      0.92      0.96      1320
           1       0.92      1.00      0.96      1320

    accuracy                           0.96      2640
   macro avg       0.96      0.96      0.96      2640
weighted avg       0.96      0.96      0.96      2640

ROC Curve and Area Under the Curve

In [129]:
from sklearn.metrics import roc_curve, auc

ROC - Logistic

In [130]:
y_pred_logistic = logmodel.predict_proba(X_test)[:,1]
logistic_fpr, logistic_tpr, threshold = roc_curve(y_test, y_pred_logistic)
auc_logistic = auc(logistic_fpr, logistic_tpr)

ROC - Random Forest

In [131]:
y_pred_rfc = rfc.predict_proba(X_test)[:,1]
rfc_fpr, rfc_tpr, threshold = roc_curve(y_test, y_pred_rfc)
auc_rfc = auc(rfc_fpr, rfc_tpr)

ROC - SVM

In [132]:
y_pred_svm = model_svm.predict_proba(X_test)[:,1]
svm_fpr, svm_tpr, threshold = roc_curve(y_test, y_pred_svm)
auc_svm = auc(svm_fpr, svm_tpr)

ROC - KNN

In [133]:
y_pred_knn = knn.predict_proba(X_test)[:,1]
knn_fpr, knn_tpr, threshold = roc_curve(y_test, y_pred_knn)
auc_knn = auc(knn_fpr, knn_tpr)
In [134]:
plt.figure(figsize=(5, 5), dpi=100)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(rfc_fpr, rfc_tpr, linestyle='-', label='RFC (auc = %0.3f)' % auc_rfc)
plt.plot(logistic_fpr, logistic_tpr, marker='.', label='Logistic (auc = %0.3f)' % auc_logistic)
plt.plot(svm_fpr, svm_tpr, marker='+', label='SVM (auc = %0.3f)' % auc_svm)
plt.plot(knn_fpr, knn_tpr, linestyle='-', label='KNN (auc = %0.3f)' % auc_knn)


plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

plt.legend(loc='best')
Out[134]:
<matplotlib.legend.Legend at 0x176b6ffa888>

Saving All Models

In [139]:
import pickle

filename = 'SVM-model.pkl'
pickle.dump(model_svm, open(filename, 'wb'))
In [140]:
import pickle

filename = 'Logistic-model.pkl'
pickle.dump(logmodel, open(filename, 'wb'))
In [141]:
import pickle

filename = 'KNN-model.pkl'
pickle.dump(knn, open(filename, 'wb'))
In [142]:
import pickle

filename = 'Random-Forest-model.pkl'
pickle.dump(rfc, open(filename, 'wb'))
In [2]:
pip install numpy
Note: you may need to restart the kernel to use updated packages.Requirement already satisfied: numpy in c:\users\asus\anaconda3\envs\tensorflow\lib\site-packages (1.19.5)

ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: 'c:\\users\\asus\\anaconda3\\envs\\tensorflow\\lib\\site-packages\\numpy-1.19.5.dist-info\\METADATA'